Objective

Using machine learning, create a model to predict the manner in which a subject performs an exercise.

This report contains some exploratory data analysis of the training set, building two models and cross validation of the models.

This dataset contains measurements from the accelerometers on the belt, arm, forearm and dumbbell of 6 different participants. Each participant performed the exercise 5 different ways:

This report briefly summarizes the approaches taken in order to understand, explore and simplify the dataset as well as the models fitted. It fits both random forest and gradient boosting models, cross-validates on a test set, and performs the final validation on the test set provided. The final model used was a random forest model with close to 99% accuracy.

Data Exploration

The data was first loaded into R. Any data marked as "NA", empty strings (""), and the Microsoft Excel expression "#DIV/0!" (indicating an infinite value, or divisiion by zero) were all set to be NA. The seed was set to 22100 for reproducibility.

As this is an assignment and in the interests of time and sanity I will be doing a simple training vs. test set cross validation. I would ideally in real life probably do a more extensive cross validation (maybe k-folds) but I'm working on my laptop and only have so much patience and computing power (and I have a feeling random forests or bagging are going to give me the best prediction, but they do tend to take forever).

library(caret)
library(tidyverse)
library(factoextra)
library(corrplot)
library(reshape2)
library(ggpubr)
library(randomForest)

The data was read in and split into two separate dataframes randomly, 70% training and 30% test.

dat <- read.csv("pml-training.csv", na.strings = c("NA", "", "#DIV/0!"), row.names = 1)
set.seed(36362)

classe <- dat$classe

datpart <- createDataPartition(classe, p = 0.7, list = FALSE)

training <- dat[datpart,]
testing <- dat[-datpart,]

I then looked at the structure and the columns of the dataset (truncated for display). The first 6 variables appear to be aimed at identification of the subject (they refer to the user's name, the time they did the exercise, etc.). I very specifically don't want to use this as my aim is to predict based on the movement, and including the name (as you will see later) would make this useless for any new subjects. The ID columns I put in their own dataframe and includes all this information as well as the class variable. I then remove any variables with close to zero variance, and any columns with NAs, bringing the potential number of variables down to 52 from the original 159.

str(training, list.len = 10)
## 'data.frame':    13737 obs. of  159 variables:
##  $ user_name               : chr  "carlitos" "carlitos" "carlitos" "carlitos" ...
##  $ raw_timestamp_part_1    : int  1323084231 1323084232 1323084232 1323084232 1323084232 1323084232 1323084232 1323084232 1323084232 1323084232 ...
##  $ raw_timestamp_part_2    : int  788290 120339 304277 484323 484434 528316 576390 604281 732306 740353 ...
##  $ cvtd_timestamp          : chr  "05/12/2011 11:23" "05/12/2011 11:23" "05/12/2011 11:23" "05/12/2011 11:23" ...
##  $ new_window              : chr  "no" "no" "no" "no" ...
##  $ num_window              : int  11 12 12 12 12 12 12 12 12 12 ...
##  $ roll_belt               : num  1.41 1.48 1.45 1.43 1.45 1.43 1.42 1.45 1.55 1.57 ...
##  $ pitch_belt              : num  8.07 8.05 8.06 8.16 8.17 8.18 8.21 8.2 8.08 8.06 ...
##  $ yaw_belt                : num  -94.4 -94.4 -94.4 -94.4 -94.4 -94.4 -94.4 -94.4 -94.4 -94.4 ...
##  $ total_accel_belt        : int  3 3 3 3 3 3 3 3 3 3 ...
##   [list output truncated]
table(training$classe)
## 
##    A    B    C    D    E 
## 3906 2658 2396 2252 2525
ID_training <- training[,c(1:6, ncol(training))]

NZV <- nearZeroVar(training)

training2 <- training[,-c(1:4, 6, NZV, ncol(training))]

ISNA <- is.na(colSums(training2))


training3 <- training2[,which(ISNA == FALSE)]
training3$classe <- classe[datpart]

With the remaining variables, I made some plots to look at how they interact with each other and with the classe variable. First I plotted the density of the differing observations for all of the different classes to see how the distributions look for all of the different variables and see if there are any obvious differences. The figure is large however we can clearly see that class A has some differing distributions in a number of the variables, including some with large peaks compared to those of the other classes. This is true for the other classes too, the distributions vary. For example "E", which refers to throwing hips forward, has a different distribution of many of the variables detected by the belt accelerometer.

melt.train <- cbind(user = ID_training$user_name, classe = ID_training$classe, training3) %>% 
    melt(id.vars = c("user", "classe"))


ggplot(melt.train, aes(x = value, colour = classe)) + 
    geom_density(lwd = 2) + facet_wrap(~variable, scales = "free", ncol = 6) + theme_bw() +
    theme(strip.text = element_text(size = 16))

I then performed a correlation matrix to see how the variables correlate with each other.

cor.1 <- cor(training3[,-ncol(training3)])

corrplot(cor.1, type = "upper", order = "hclust", tl.cex = 0.8)

Next I performed a PCA on the unscaled data and looked at how they distribute according to the user and the type of exercise performed. They don't cluster clearly separately by either class or by user name. The variables that contribute the most to the dimensions can also be seen.

prin.comp.1 <- prcomp(training3[,-ncol(training3)])

f1 <- fviz_pca_ind(prin.comp.1, 
                   geom = "point", habillage = ID_training$classe, 
                   addEllipses = TRUE,
                   palette = "jco",
                   axes = 1:2)

f2 <- fviz_pca_ind(prin.comp.1,
                   geom = "point", habillage = ID_training$user_name,
                   addEllipses = TRUE,
                   palette = "jco",
                   axes = 1:2)

f3 <- fviz_contrib(prin.comp.1, choice = "var", axes = 1:2)

ggarrange(f1, f2, f3, nrow = 1)

Interestingly when the same is performed on scaled data, the data divides clearly into clusters but primarily based on the user however, not based on the class. The structure of this PCA plot and the way it clusters gives me the impression that pca transformation may not be the most helpful. In the future one idea might be to find

prin.comp.2 <- prcomp(training3[,-ncol(training3)], scale = TRUE, center = TRUE)

f4 <- fviz_pca_ind(prin.comp.2, 
                   geom = "point", habillage = ID_training$classe, 
                   addEllipses = TRUE,
                   palette = "jco",
                   axes = 1:2)

f5 <- fviz_pca_ind(prin.comp.2,
                   geom = "point", habillage = ID_training$user_name,
                   addEllipses = TRUE,
                   palette = "jco",
                   axes = 1:2)

f6 <- fviz_contrib(prin.comp.2, choice = "var", axes = 1:2)

ggarrange(f4, f5, f6, nrow = 1)

Modelling

The first model tested is random forests - these tend to perform but can be quite slow. It is built with 500 trees using the 52 variables from the training set.

set.seed(1255)
mdl1 <- train(classe ~ ., training3, method = "rf")

The results of the model build can be seen below. The best tune is 27 and has an accuracy of almost 99% in the training set.

mdl1$bestTune
##   mtry
## 2   27
mdl1$results
##   mtry  Accuracy     Kappa  AccuracySD     KappaSD
## 1    2 0.9878217 0.9845877 0.001521653 0.001924103
## 2   27 0.9879447 0.9847439 0.001896103 0.002399309
## 3   52 0.9800302 0.9747279 0.004635280 0.005858635

The confusion Matrix for this model is displayed below, and the confusion matrix for the testing data. It is very similar to the above in train accuracy, around 99%. (It is strangely a little higher... There may be some overfitting here but they are very close).

pred1 <- predict(mdl1, testing)

c1 <- confusionMatrix(pred1, as.factor(testing$classe))

c1$overall
##       Accuracy          Kappa  AccuracyLower  AccuracyUpper   AccuracyNull 
##      0.9926933      0.9907573      0.9901704      0.9947072      0.2844520 
## AccuracyPValue  McnemarPValue 
##      0.0000000            NaN
c1$table
##           Reference
## Prediction    A    B    C    D    E
##          A 1669    9    0    0    0
##          B    3 1128    4    0    0
##          C    1    2 1019   13    1
##          D    0    0    3  951    6
##          E    1    0    0    0 1075

Folllowing this I performed a gradient boosting model using again all 52 variables.

set.seed(1784)
mdl2 <- train(classe ~., training3, method = "gbm", verbose = FALSE)

The accuracy is slightly lower (95.8%) than random forests for the training set.

mdl2$results
##   shrinkage interaction.depth n.minobsinnode n.trees  Accuracy     Kappa
## 1       0.1                 1             10      50 0.7511188 0.6844542
## 4       0.1                 2             10      50 0.8553524 0.8167609
## 7       0.1                 3             10      50 0.8945090 0.8664837
## 2       0.1                 1             10     100 0.8201878 0.7724677
## 5       0.1                 2             10     100 0.9039111 0.8784152
## 8       0.1                 3             10     100 0.9376546 0.9211201
## 3       0.1                 1             10     150 0.8514350 0.8120282
## 6       0.1                 2             10     150 0.9276043 0.9084021
## 9       0.1                 3             10     150 0.9566160 0.9451137
##    AccuracySD     KappaSD
## 1 0.007854333 0.009873741
## 4 0.004716541 0.005913790
## 7 0.004794369 0.006074506
## 2 0.006556047 0.008214343
## 5 0.004786531 0.006021602
## 8 0.003627996 0.004611101
## 3 0.005927030 0.007478725
## 6 0.004059517 0.005117935
## 9 0.003122326 0.003965639
mdl2$bestTune
##   n.trees interaction.depth shrinkage n.minobsinnode
## 9     150                 3       0.1             10

It is very similar on the test set (95.57%) after having made a confusion matrix.

pred2 <- predict(mdl2, testing)

c2 <- confusionMatrix(pred2, as.factor(testing$classe))

c2$overall
##       Accuracy          Kappa  AccuracyLower  AccuracyUpper   AccuracyNull 
##   9.624469e-01   9.524953e-01   9.572700e-01   9.671586e-01   2.844520e-01 
## AccuracyPValue  McnemarPValue 
##   0.000000e+00   3.051514e-05
c2$table
##           Reference
## Prediction    A    B    C    D    E
##          A 1643   37    0    0    2
##          B   21 1065   28    4   11
##          C    6   35  986   27    5
##          D    3    1   11  925   19
##          E    1    1    1    8 1045

Final Validation

Finally we read in the 20 samples from the testing and test them on the random forest model (selected because it is the best one according to accuracy). Out of curiosity I also ran it on the boosted model and see that they mostly agree with the exception of the first observation. For the final testing though I will be using the random forest model, as although I think it is slightly overfit, it is still a better model.

dat2 <- read.csv("pml-testing.csv", na.strings = c("NA", "", "#DIV/0!"), row.names = 1)

pred.Final <- predict(mdl1, dat2)
pred.Final
##  [1] B A B A A E D B A A B C B A E E A B B B
## Levels: A B C D E
pred.Final2 <- predict(mdl2, dat2)
pred.Final2
##  [1] B A B A A E D B A A B C B A E E A B B B
## Levels: A B C D E